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ABSTRACT 

We  attemped  to  improve  recognition  accuracy  by  reduc¬ 
ing  the  inadequacies  of  the  lexicon  and  language  model. 
Specifically  we  address  the  following  three  problems:  (1) 
the  best  size  for  the  lexicon,  (2)  conditioning  written  text 
for  spoken  language  recognition,  and  (3)  using  additional 
training  outside  the  text  distribution.  We  found  that  in¬ 
creasing  the  lexicon  20,000  words  to  40,000  words  re¬ 
duced  the  percentage  of  words  outside  the  vocabulary 
from  over  2%  to  just  0.2%,  thereby  decreasing  the  error 
rate  substantially.  The  error  rate  on  words  already  in  the 
vocabulary  did  not  increase  substantially.  We  modified 
the  language  model  training  text  by  applying  rules  to  sim¬ 
ulate  the  differences  between  the  training  text  and  what 
people  actually  said.  Finally,  we  found  that  using  another 
three  years’  of  training  text  -  even  without  the  appropri¬ 
ate  preprocessing,  substantially  improved  the  language 
model  We  also  tested  these  approaches  on  spontaneous 
news  dictation  and  found  similar  improvements. 

1.  INTRODUCTION 

Speech  recognition  accuracy  is  affected  as  much  by  the  lan¬ 
guage  model  as  by  the  acoustic  model.  In  general,  the  word 
error  rate  is  roughly  proportional  to  the  square  root  of  the 
perplexity  of  the  language  model.  In  addition,  in  a  natural 
unlimited  vocabulary  task,  a  substantial  portion  of  the  word 
errors  come  ffom  words  that  ate  not  even  in  the  recognition 
vocabulary.  These  out-of-vocabulary  (OOV)  words  have  no 
chance  of  being  recognized  correctly.  Thus,  our  goal  is  to 
estimate  a  good  language  model  from  the  available  training 
text,  and  to  determine  a  vocabulary  that  is  likely  to  cover 
the  test  vocabulary. 

The  straightforward  solution  to  improving  the  language 
model  might  be  to  increase  the  complexity  of  the  model 
(e.g.,  use  a  higher  order  Markov  chain)  and/or  obtain  more 
language  model  training  text.  But  this  by  itself  will  not  nec¬ 
essarily  provide  a  better  model,  especially  if  the  text  is  not 
an  ide^  model  of  what  people  will  actually  say.  The  simple 
solution  to  increase  the  coverage  of  the  vocabulary  is  to  in¬ 
crease  the  vocabulary  size.  But  this  also  increases  the  word 
error  rate  and  the  computation  and  size  of  the  recognition 
process. 

In  this  paper  we  consider  several  simple  techniques  for  im¬ 


proving  the  power  of  the  language  model.  First,  in  Section 
3,  we  explore  the  effect  of  increasing  the  vocabulary  size  on 
recognition  accuracy  in  an  unlimited  vocabulary  task.  Sec¬ 
ond,  in  Section  4,  we  consider  ways  to  model  the  differences 
between  the  language  model  training  text  and  the  way  peo¬ 
ple  actually  speak.  And  third,  in  Section  5,  we  show  that 
simply  increasing  the  amount  of  language  model  training 
helps  significantly. 


2.  THE  WSJ  CORPUS 


The  November  1993  ARPA  Continuous  Speech  Recognition 
(CSR)  evaluations  was  based  on  speech  and  language  taken 
from  the  Wall  Street  Journal  (WSJ).  The  standard  language 
model  training  text  was  estimated  from  about  35  million 
words  of  text  extracted  from  the  WSJ  from  1987  to  1989. 
The  text  was  normalized  (preprocessed)  with  a  model  for 
what  words  people  use  to  read  open  text.  For  example, 
“$234.56”  was  always  assumed  to  be  read  as  “two  hundred 
thirty  four  dollars  and  fifty  six  cents”.  “March  13”  was 
always  normalized  as  “March  thirteenth”  -  not  “March  the 
thirteenth”,  nor  “March  thirteen”.  And  so  on. 

The  original  processed  text  contains  about  160,000  unique 
words.  However,  many  of  these  are  due  to  misspellings. 
Therefore,  the  test  corpus  was  limited  to  those  sentences  that 
consisted  only  of  the  most  likely  64,000  words.  While  this 
vocabulary  is  still  quite  large,  it  has  two  beneficial  effects. 
First,  it  greatly  reduces  the  number  of  misspellings  in  the 
texts.  Second,  it  allows  implementations  to  use  2-byte  data 
fields  to  represent  the  words  rather  than  having  to  use  4 
bytes. 

The  “standard”  recognition  vocabulary  was  defined  as  the 
most  likely  20,000  words  in  the  corpus.  Then,  the  standard 
language  model  was  defined  as  a  trigram  language  model 
estimated  specifically  for  these  20K  words.  This  standard 
model,  provided  by  Lincoln  Laboratory,  was  to  be  used  for 
the  controlled  portion  of  the  recognition  tests.  In  addition, 
participants  were  encouraged  to  generate  an  improved  lan¬ 
guage  model  by  any  means  (other  than  examining  the  test 
data). 
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3.  RECOGNITION  LEXICON 

We  find  that,  typically,  over  2%  of  the  word  occurrences  in  a 
developmrat  set  are  not  included  in  the  standard  20K-word 
vocabulary.  Naturally,  words  that  are  not  in  the  vocabu¬ 
lary  cannot  be  recognized  accurately.  (At  best,  we  might 
try  to  detect  that  there  is  one  or  more  unknown  words  at 
this  point  in  a  sentence,  and  then  attempt  to  recognize  the 
phoneme  sequence,  and  then  guess  a  possible  letter  sequence 
for  this  phoneme  sequence.  Unfortunately,  in  English,  even 
if  we  could  recognize  the  phonemes  perfectly,  there  are 
many  valid  ways  to  spell  a  particular  phoneme  sequence.) 
However,  in  addition  to  this  word  not  being  recognized,  we 
often  see  that  one  or  two  words  adjacent  to  this  missing  word 
are  also  misrecognized.  This  is  because  the  recognition,  in 
choosing  a  word  in  its  vocabulary,  also  now  has  the  wrong 
context  for  the  following  or  preceding  words.  In  general, 
we  find  that  the  word  error  rate  increases  by  about  1.5  to  2 
times  the  number  of  out-of-vocabulary  (OOV)  words. 

One  simple  way  to  decrease  the  percentage  of  OOV  words 
is  to  increase  the  vocabulary  size.  But  which  words  should 
be  added?  The  obvious  solution  is  to  add  words  in  order  of 
their  relative  frequency  within  the  full  text  corpus.  There 
are  several  problems  that  might  result  from  this: 

1.  The  vocabulary  might  have  to  be  extremely  large  be¬ 
fore  the  OOV  rate  is  reduced  significantly. 

2.  If  the  word  error  rate  for  the  vast  majority  of  the  words 
that  are  already  in  the  smaller  vocabulary  increased  by 
even  a  small  amount,  it  might  offset  any  gain  obtained 
firom  reducing  the  OOV  rate. 

3.  The  language  model  probabilities  for  these  additional 
words  would  be  quite  low,  which  might  prevent  them 
from  being  recognized  anyway. 

We  did  not  have  phonetic  pronunciations  for  all  of  the  64K 
words.  We  sent  a  list  of  the  (approximately  34K)  words  for 
which  we  had  no  pronunciations  to  Boston  University.  They 
found  pronunciations  for  about  half  (18K)  of  the  words  in 
their  (expanded  Moby)  dictionary.  When  we  added  these 
words  to  our  WSJ  dictionary,  we  had  a  total  of  50K  words 
that  we  could  use  for  recognition. 

The  following  table  shows  the  percentage  of  OOV  words  as  a 
function  of  the  vocabulary  size.  The  measurement  was  done 
on  the  WSJl  Hubl  “20K”  development  test  which  has  2,464 
unique  words  with  the  total  count  of  8,227  words.  Due  to  the 
unavailability  of  phonetic  pronunciations  (mentioned  above), 
the  final  vocabulary  size  would  be  the  second  column. 

We  were  somewhat  surprised  to  see  that  the  percentage  of 
OOV  words  was  reduced  to  only  0.17%  when  the  lexicon 
included  the  most  likely  40K  words  -  especially  given  that 
many  of  the  most  likely  words  were  not  available  because 
we  id  not  have  phonetic  pronunciations  for  them.  Thus, 
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it  was  not  necessary  to  increase  the  vocabulary  above  40K 
words. 

The  second  worry  was  that  increasing  the  vocabulary  by  too 
much  might  increase  the  word  error  rate  due  to  the  increased 
number  of  choices.  For  example,  normally,  if  we  double  the 
vocabulary,  we  might  expect  an  increase  in  word  error  rate 
of  about  40% !  So  we  performed  an  experiment  in  which 
we  used  the  standard  20K  language  model  for  the  SK  de¬ 
velopment  data.  We  found,  to  our  surprise,  that  the  error 
rate  increased  only  slightly,  from  8.7%  to  9.3%.  Therefore, 
we  felt  confident  that  we  could  increase  the  vocabulary  as 
needed. 

We  considered  possible  explanations  for  the  small  increase 
in  error  due  to  a  larger  vocabulary.  We  realized  that  the  an¬ 
swer  was  in  the  language  model.  In  the  first  case,  when  we 
just  increase  the  vocabulary,  the  new  words  also  have  the 
same  probability  in  the  language  model  as  the  old  words. 
However,  in  this  case,  all  the  new  words  that  WCTe  added 
had  lower  probabilities  (at  least  fOT  the  unigram  model)  than 
the  existing  words.  Let  us  consid^  two  possibilities  that  we 
would  not  falsely  substitute  a  new  word  for  an  old  one.  If 
the  new  word  were  acoustically  similar  to  one  of  the  words 
in  the  test  (and  th^efore  similar  to  a  word  in  the  original 
vocabulary,  then  the  word  would  be  correctly  recognized  be¬ 
cause  the  original  word  would  always  have  a  high»  language 
model  probability.  If,  on  the  other  hand,  the  new  word  were 
acoustically  very  different  from  the  word  being  spoken,  then 
we  might  expect  that  our  acoustic  models  would  prevent  the 
new  word  from  being  chosen  over  the  old  word.  While  the 
argument  makes  some  sense,  we  did  not  expect  the  loss  for 
increasing  the  vocabulary  from  5K  words  to  20K  words  to 
be  so  small. 

Finally,  the  third  question  is  whether  the  new  words  would 
be  recognized  when  they  did  occur,  since  (as  mentioned 
above)  their  language  model  probabilities  were  generally 
low.  In  fact,  we  found  that,  even  though  the  error  rate  for 
these  new  words  was  higher  than  for  the  more  likely  words, 
we  were  still  able  to  recognize  about  50%  to  70%  of  them 
correctly,  presumably  based  largely  on  the  acoustic  model. 
Thus,  the  net  effect  of  this  was  to  reduce  the  word  error  rate 
by  about  1%  to  1.5%,  absolute. 
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4.  MODELING  SPOKEN  LANGUAGE 

AnoUier  effect  that  we  worked  on  was  the  difference  between 
the  processed  text,  as  ^fined  by  the  preprocessor,  and  the 
wordii  that  people  actually  used  when  re^ng  WSJ  text.  In 
the  pilot  WSJ  corpus,  the  subjects  were  prompted  with  texts 
that  had  already  been  “normalized”,  so  that  there  was  no 
ambiguity  about  how  to  read  a  sentence.  However,  in  the 
WSJl  corpus,  subjects  were  instructed  to  read  the  original 
texts  iind  to  say  whatever  seemed  most  appropriate  to  them. 
Since  the  WSJl  prompting  texts  were  not  normalized  to 
deterministic  word  sequences,  subjects  showed  considerable 
variability  in  their  reading  of  the  prompting  text 

However,  the  standard  language  model  was  derived  from  the 
normalized  text  produced  by  the  preprocessor.  This  resulted 
in  a  mismatch  tetween  the  language  model  and  the  actual 
word  sequences  that  were  spoken.  While  the  preprocessor 
was  quite  good  at  predicting  what  people  said  most  of  the 
time,  there  were  several  cases  where  people  used  different 
words  than  predicted.  For  example,  the  preprocessor  pre¬ 
dicted  that  strings  like  “$234”  would  be  read  as  “two  hun¬ 
dred  thirty  four  dollars”.  But  in  fact,  most  people  read  this 
as  “two  hundred  AND  thirty  four  dollars”.  For  another  ex¬ 
treme  example,  the  preprocessor’s  prediction  of  “10.4”  was 
“ten  point  four”,  but  the  subject  (in  the  WSJl  development 
data)  read  this  as  “ten  and  four  tenths”.  There  were  many 
other  similar  examples. 

The  standard  model  for  the  tests  was  the  “nonverbalized 
punctuation”  (NVP)  model,  which  assumes  that  the  readers 
never  speak  any  of  the  punctuation  words.  The  other  model 
that  had  been  defined  was  the  “verbalized  punctuation”  (VP) 
model,  which  assumed  that  all  of  the  punctuation  was  read 
out  loud.  This  year,  the  subjects  were  instructed  that  they 
were  free  to  read  the  punctuation  out  loud  or  not,  in  what¬ 
ever  way  they  feel  most  comfortable.  It  turns  out  that  people 
didn’t  verbalize  most  punctuation.  However,  they  regularly 
verbalized  quotation  marks  in  many  different  ways  that  were 
all  different  than  the  ways  predict^  by  the  standard  prepro¬ 
cessor. 

There  were  also  several  words  that  were  read  differently  by 
subjects.  For  example,  subjects  pronounced  abbreviations 
Uke,  “CORP.”  and  “INC.”.  While  the  preprocessor  assumed 
that  aU  abbreviations  would  be  read  as  full  words. 

We  used  two  methods  to  model  the  ways  people  actually 
read  text.  The  simpler  approach  was  to  include  the  text  of 
the  acoustic  training  data  in  the  language  model  training. 
That  is,  we  simply  added  the  37K  sentence  transcriptions 
from  the  acoustic  training  to  the  2M  sentences  of  training 
text  The  advantage  of  this  method  is  that  it  modeled  what 
people  actually  said.  The  system  was  definitely  more  likely 
to  recognize  words  or  sequences  that  were  previously  impos¬ 
sible.  The  problem  with  this  method  was  that  the  amount  of 
transcribed  speech  was  quite  small  (about  50  times  smaller) 
compared  to  the  original  training  text  We  tried  repeating 


the  transcriptions  several  times,  but  we  found  that  the  effect 
was  not  as  strong  as  we  would  like. 


A  more  powerful  approach  was  to  simulate  the  effects  of  the 
different  word  choices  by  simple  rules  which  were  applied 
to  all  of  the  35M  words  of  language  training  text.  We  chose 
to  use  the  following  rules: 


Preprocessed  Text 
HUNDRED  [number] 
ONE  HUNDRED 
ONE  DOLLAR 
ZERO  POINT  [number] 
AND  ONE  HALF 
AND  ONE  QUARTER 


Simulated  Text 
HUNDRED  AND  [number] 
A  HUNDRED 
A  DOLLAR 
POINT  [number] 

AND  A  HALF 
AND  A  QUARTER 


Thus,  for  example,  if  the  sentence  consists  of  the  pattern 
“hundred  twenty”,  we  repeated  the  same  sentence  with  “hun¬ 
dred  AND  twenty”. 

The  result  was  that  about  one  fifth  of  the  sentences  in  the 
original  corpus  had  some  change  reflecting  a  difference  in 
the  way  subjects  read  the  original  text  Thus,  this  was  equiv¬ 
alent  in  weight  to  an  equal  amount  of  training  text  to  the 
original  text. 

We  found  that  this  preprocessing  of  the  text  was  sufficient  to 
cover  most  of  those  cases  where  the  readers  said  things  dif¬ 
ferently  than  the  predictions.  The  recognition  results  showed 
that  the  system  now  usually  recognized  the  new  word  se¬ 
quences  and  abbreviations  correctly. 


5.  INCREASING  THE  LANGUAGE  MODEL 
TRAINING 

While  35M  words  may  seem  like  a  lot  of  data,  it  is  not 
enough  to  cover  all  of  the  trigrams  that  are  likely  to  occur 
in  the  testing  data.  So  we  considered  other  sources  for  ad¬ 
ditional  language  modeling  text.  The  only  easily  accessible 
data  available  was  an  additional  3  years  (from  1990-1992) 
of  WSJ  data  from  the  TIPSTER  corpus  produced  by  the 
Linguistic  Data  Consortium  (LDC). 

However,  there  were  two  problems  with  using  this  data. 
First,  since  the  test  data  was  known  to  come  from  1987- 
1989,  we  were  concerned  that  this  might  actually  hurt  per¬ 
formance  due  to  some  differences  in  the  topics  during  that 
3-year  period.  Second,  this  text  had  not  been  normalized 
with  the  preprocessor  and  we  did  not  have  available  to  us 
the  preprocessor  that  was  used  to  transform  the  raw  text  into 
word  sequences. 

We  decided  to  use  the  new  text  with  minimal  processing. 
The  text  was  filtered  to  remove  all  tables,  captions,  num¬ 
bers,  etc.  We  replaced  each  initial  example  of  double-quote 
(“)  with  “QUOTE  and  the  matching  token  with  ’’UNQUOTE 
or  ”ENDQUOTE,  which  were  the  most  common  ways  these 
words  were  said.  No  other  changes  were  made.  We  just 
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used  the  raw  text  as  it  was.  One  benefit  of  this  was  that  ab¬ 
breviations  were  left  as  they  appeared  in  the  text  rather  than 
expanded.  Any  numbers,  dat^,  dollar  amounts,  etc,  were 
just  considered  “unknown”  words,  and  did  not  contribute  to 
the  training.  We  assumed  that  we  had  sufficient  examples 
of  numbers  in  the  original  text. 

We  found  that  adding  this  additional  language  training  data 
reduced  the  error  by  about  7%  of  the  error,  indicating  that  the 
original  35  million  words  was  not  sufficient  for  the  models 
we  were  using.  Thus,  the  addition  of  plain  text,  even  though 
it  was  from  a  different  three  years,  and  had  many  gaps  due 
to  apparent  unknown  words,  still  improved  the  recognition 
accuracy  considerably. 

6.  RESULTS 

The  following  table  shows  the  benefit  of  the  enlarged  40K 
lexicon  and  the  enhanced  language  model  training  on  the 
OOV  rate  and  the  word  error  for  the  development  test  and 
the  evaluation  test 


%OOV 

%  Word  Error 
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20K 

40K 

20K 

40K 

Development 

2.27 

0.17 

16.4 

12:9" 

Evaluation 

1.83 

0.23 

14.2 

12.2 

Suiprisingly,  the  addition  of  three  year’s  LM  training  (firom 
a  period  post-dating  the  test  data)  improved  performance  on 
the  utterances  that  were  completely  inside  the  vocabulary. 
Evidently,  even  the  common  trigrams  are  poorly  trained  with 
only  the  35  million  word  WSJO  corpus.  Overall,  our  mod¬ 
ifications  to  the  lexicon  and  grammar  training  reduced  the 
word  error  by  14-22%. 

7.  Spontaneous  Dictation 

Another  area  we  investigated  was  spontaneous  dictation. 
The  subjects  were  primarily  former  or  practicing  journal¬ 
ists  with  some  experience  at  dictation.  They  were  instructed 
to  dictate  genial  and  financial  news  stories  that  would  be 
appropriate  for  a  newspaper  like  WSJ.  In  general,  the  jour¬ 
nalists  chose  topics  of  recent  interest.  This  meant  that  the 
original  language  model  was  often  out  of  date  for  the  sub¬ 
ject.  As  a  result,  the  percentage  of  OOV  words  increased 
(to  about  4%),  and  the  language  model  taken  from  WSJ  text 
was  less  appropriate. 

The  OOV  words  in  the  spontaneous  data  were  more  likely 
to  be  proper  nouns  from  recent  events  that  were  not  covered 
by  the  LM  training  material.  To  counter  this,  we  added  all 
(1,028)  of  the  new  words  that  were  found  in  the  spontaneous 
portion  of  the  acoustic  training  data  in  WSJl.  This  mostly 
included  topical  names  (e.g.,  Hillary  Rodham,  NAFTA,  etc.). 

In  order  to  account  for  some  of  the  differences  between  the 


read  text  and  the  spontaneous  text,  and  to  have  language 
model  probabilities  for  the  new  words,  we  added  the  train¬ 
ing  transcriptions  of  the  spontaneous  dictation  (^ut  8K 
sentences)  to  the  LM  training  as  well. 

New  weights  for  the  new  language  model,  HMM,  and  Seg¬ 
mental  Neural  Network  were  all  optimized  on  spontaneous 
development  test  data.  The  table  below  shows  that  the  OOV 
remains  near  1%  even  after  the  enlargement  to  a  41K  lexi¬ 
con. 


%  OOV 

%  Word  Error 

Test  Set 

41K 

41K 

Development 

2.9 

1.4 

■j^j 

- 

mg 

Evaluation 

4.8 

1.9 

■HI 

24.7 

■QO 

As  can  be  seen,  increasing  the  vocabulary  size  from  20K 
to  40K  significantly  reduced  the  OOV  rate.  It  is  important 
to  point  out  that  in  this  case,  we  did  not  have  the  ben¬ 
efit  of  a  word  frequency  list  for  spontaneous  speech,  and 
that  the  source  of  speech  had  an  unlimited  vocabulary.  So 
the  reduction  in  OOV  rate  is  certainly  a  fair  -  if  not  pes¬ 
simistic  -  estimate  of  the  real  benefit  from  increasing  the 
vocabulary.  Adding  the  few  new  words  observed  in  the 
spontaneous  speech  also  helped  somewhat,  but  not  nearly  as 
much.  The  sample  of  only  8,(XX)  sentences  is  clearly  not 
sufficient  to  find  aU  the  new  words  that  people  might  use. 
Presumably,  if  the  sample  of  spontaneous  speech  were  large 
enough  to  derive  word  frequencies,  then  we  could  choose  a 
much  better  list  of  40K  words  with  a  lower  OOV  rate. 

Overall,  the  41K  trigram  reduces  the  word  error  by  23% 
over  the  20K  standard  trigram  on  the  November  ’93  CSR  S9 
evaluation  test.  We  estimate  that  more  than  half  of  this  gain 
was  due  to  the  decreased  percentage  of  OOV  words,  and  the 
r^ainder  was  due  to  the  increased  language  model  training, 
including  specific  examples  of  spontaneous  dictation. 

8.  CONCLUSIONS 

We  found  the  following  interesting  results: 

•  Expanding  the  vocabulary  with  less  frequent  words 
does  not  substantially  increase  the  word  error  on  those 
words  already  in  the  vocabulary,  but  does  eliminate 
many  errors  due  to  OOV  words. 

•  Doubling  the  amount  of  language  model  training  text 
improves  the  language  model,  even  though  the  text 
comes  from  different  years  than  the  test,  and  even 
though  the  text  was  not  preprocessed  into  proper  lexi¬ 
cal  forms. 

•  It  is  possible  to  improve  the  quality  of  the  language 
modeling  text  by  modeling  the  differences  between  the 
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predicted  reading  style  and  some  examples  of  actual 
transcriptions. 

•  Increasing  the  vocabulary  size  and  language  training 
hadl  a  bigger  effect  on  spontaneous  speech  than  it  did 
for  read  speech. 
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